112 ◾ Bioinformatics
1110696, 1230237, and 1234567. The first two variants are single point substitutions. The
third position has two alternate alleles (G and T) that replaced the ref nucleotide (A). The
fourth variant is a deletion of a single nucleotide T since the alt allele is missing (“.”). In the
fifth position, there are two alternative alleles, the first is a deletion of two nucleotides (T
and C) and the second is an insertion of a single nucleotide T.
The QUAL column holds the quality level of the data at each position. The FILTER column
designates what filters can be applied; the keywords in this column can be used to filter the
variants as we will discuss later. The second row (position 17330) does not pass the thresh-
old for the quality of more than 10 Phred quality score.
The INFO column includes position-level information for that data row and can be
thought as aggregate data that includes all of the sample-level information specified.
The FORMAT column specifies the sample-level fields to expect under each sample.
Each row has the same format fields (GT, GQ, DP, and HQ) except for the last row which
does not have HQ. Each of these fields is described in the metadata section. GT (Genotype)
indicates which alleles separated by / are unphased or | phased, GQ is the Genotype
Quality which is a single integer, DP is the Read Depth which is a single integer, and HQ is
the Haplotype Quality, and it has two integers separated by a comma.
This VCF file has three samples identified by their names (NA00001, NA00002, and
NA00003) in columns 10 through 12.
Genetic variants discovered by researchers are submitted, usually in VCF files, to data-
bases that archive information of the genetic differences with other related information.
Researchers submit data to these databases, which collect, organize, and publicly docu-
ment the evidence supporting links between genetic variants and diseases or conditions.
The variants are usually submitted with their assertions, which are informed assessments
of the association or lack of association between a disease or condition and a genetic vari-
ant based on the current state of knowledge. The variant databases include dbSNP (for
human variants of lesser than 50 base pairs), dbVar (for human variants of greater than 50
base pairs), and European Variation Archive (for variants of all species).
Variant submitted to a database is given a unique identifier that can be used in finding
that variant in the database and the related information because they are unambiguous,
FIGURE 4.1 VCF file showing metadata and data sections.